2024-10-22

PAPER

methods: how do different text mining approaches spotlight different historical trends/fluctuations/moments in labor-environmental (labor? environmental?) issues since the 1960s? (i.e., compare the methods)

  • period distinctiveness (tf-ipf)
  • textual averages (triples)
  • convergence/divergence (topic models + word co-occurrence networks)

UPDATES

  • re-subset speeches with new set of environmental + labor keywords (two versions)
  • re-ran speech counts, token counts,
  • re-ran tf-ipf code (this time labor speech data was small enough to work)
  • workshopping triples code–giving me some trouble. i’ll just move on to topic models + co-occurrence networks for next week, for sake of previewing results beyond tf-ipf.

KEYWORDS


how to capture the widest possible net of “env’t” and “labor” without capturing beyond’ env’t and labor. the fewest, vaguest possible terms that still get us what we want. i.e., striking balance between capturing most expansive umbrella of “env’t” and “labor” without:

  • casting too wide a net and capturing non-environmental and non-labor speeches
  • overdetermining the issues/terms associated with each

i.e., my analysis should tell me when something like “globalization” is articulated specifically as a labor issue or “urbanization” is articulated specifically as an environmental issue, without preemptively compiling ALL globalization or urbanization speeches into the enviro-labor speeches dataset.


environmental/labor keywords v1: specific

environmental v1: specific


labor v1: specific


environmental/labor keywords v2: broad

environmental v2: broad


labor v2: broad


SPEECHES

environmental/labor speeches: v1

enviro speeches v1 (sample of n=1 per year):

labor speeches v1 (sample of n=1 per year):

enviro-labor speeches v1 (sample of n=1 per year):

environmental/labor speeches: v2

speeches per year v2:

enviro speeches v2 (sample of n=1 per year):

labor speeches v2 (sample of n=1 per year):

enviro-labor speeches v2 (sample of n=1 per year):


TOKENS

top 25 bigrams:

enviro v1
enviro v2
labor v1
labor v2
enviro-labor v1
enviro-labor v2

top 10 tokens by year:

enviro v1
enviro v2


labor v1
labor v2


enviro-labor v1
enviro-labor v2

top 10 bigrams by year:

enviro v1
enviro v2


labor v1
labor v2


enviro-labor v1
enviro-labor v2

TF-IPF

(run using v2 keywords, “environmental” and “labor”)

enviro tf-ipf

20-yr periods:


10-yr periods:


5-yr periods:

labor tf-ipf

20-yr periods:


10-yr periods:


5-yr periods:

enviro-labor tf-ipf

20-yr periods:


10-yr periods:


5-yr periods:

NEXT STEPS

  • CLC presentation thursday (5 mins, 2-4 slides)
  • subsetting:
    • final enviro/labor keywords?
    • figure out encoding problem
  • analysis:
    • triples
    • topic modeling
  • validation:
    • are there meaningful differences between daily vs. bound speeches?
    • validating scraped 2016-2024 data cleaning/processing against stanford 1873-2016 data cleaning/processing